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Abstract — Manifold models provide low-dimensional represen- 
tations that are useful for processing and analyzing data in 
a transformation-invariant way. In this paper, we study the 
problem of learning smooth pattern transformation manifolds 
from image sets that represent observations of geometrically 
transformed signals. In order to construct a manifold, we build 
a representative pattern whose transformations accurately fit 
various input images. We examine two objectives of the mani- 
fold building problem, namely, approximation and classification. 
For the approximation problem, we propose a greedy method 
that constructs a representative pattern by selecting analytic 
atoms from a continuous dictionary manifold. We present a 
DC optimization scheme that is applicable to a wide range 
of transformation and dictionary models, and demonstrate its 
application to transformation manifolds generated by the rota- 
tion, translation and anisotropic scaling of a reference pattern. 
Then, we generalize this approach to a setting with multiple 
transformation manifolds, where each manifold represents a 
different class of signals. We present an iterative multiple 
manifold building algorithm such that the classification accuracy 
is promoted in the learning of the representative patterns. 
Experimental results suggest that the proposed methods yield 
high accuracy in the approximation and classification of data 
compared to some reference methods, while the invariance to 
geometric transformations is achieved due to the transformation 
manifold model. 

Index Terms — Manifold learning, pattern transformation man- 
ifolds, pattern classification, transformation-invariance, sparse 
approximations. 

I. Introduction 

THE representation of high-dimensional signal sets with 
signal manifolds has several benefits. Manifold models 
provide concise and low-dimensional representations that fa- 
cilitate the treatment of signals. In the case of geometric 
transformation manifolds, the knowledge of the generating 
model provides a basis for the registration of signals. More- 
over, in a setting where different signal classes are represented 
with different manifolds, the class label of a query signal 
can be estimated by comparing its distance to the candidate 
manifolds. 

In this work, we focus on pattern transformation manifolds. 
A pattern transformation manifold (PTM) represents images 
that are generated from a reference pattern that undergoes a 
certain set of geometric transformations. For instance, the im- 
ages obtained by the rotation and scaling of a reference pattern 
form a PTM. Given a set of visual data that are assumed to be 
geometrically transformed observations of a signal, we address 
the problem of constructing a PTM that represents the data 
accurately. We assume that the type of the transformations 

E. Vural and P. Frossard are with Ecole Polytechnique Federale de Lausanne 
(EPFL), Signal Processing Laboratory - LTS4, CH-1015 Lausanne, Switzer- 
land, email: elif.vural@epfl.ch, pascal.frossard@epfl.ch. 

This work has been partly funded by the Swiss National Science Foundation 
under Grant 200020_132772. 



that generate the input images, i.e., the transformation model, 
is known. However, we do not assume any prior alignment of 
the input images; i.e., the individual transformation parameters 
corresponding to the images are to be computed. Under these 
assumptions, our manifold computing problem is formulated 
as the construction of a representative pattern, together with 
the estimation of the transformation parameters approximating 
the input signals. We consider a PTM model that is generated 
by smooth geometric transformations. We propose to build 
the representative pattern as a linear combination of some 
parametric atoms, which are waveforms that are adapted to 
the local structures of signals The atoms are selected 
from a continuous dictionary manifold that is formed by 
the smooth geometric transformations of an analytic mother 
function. The utilization of smooth and parametric atoms in 
the pattern construction brings desirable properties such as 
the smoothness of the PTM, and a parametric approximation 
of the input data, which is useful for effective description 
of data information. We study the PTM building problem in 
two parts, where we respectively address approximation and 
classification applications. 

In the data approximation part, we build on our previous 
work |2| and aim at obtaining an accurate transformation- 
invariant approximation of input images with the learned 
manifold. We iteratively construct a representative pattern 
by successive addition of atoms such that the total squared 
distance between the input images and the transformation 
manifold is minimized. The selection of an atom is then 
formulated as an optimization problem with respect to the 
parameters and the coefficient of the atom. We propose a 
two- stage solution for the atom selection, where we first 
estimate the parameters of a good atom and then improve this 
solution. In the first stage, we derive an approximation of the 
objective function (total squared distance) in a DC (Difference- 
of-Convex) form; i.e., in the form of a difference of two 
convex functions. We describe a procedure for computing 
this DC decomposition when a DC form of the geometrically 
transformed atom is known. The resulting DC approximation 
is minimized using a DC solver. Then, we refine the solution 
of the first stage with a gradient descent method where we 
approximate the manifold distance by the tangent distance in 
the objective function. Although our methodology in this paper 
is based on ideas very similar to those of |2|, we generalize 
the setting to arbitrary transformation manifolds, dictionary 
models and mother functions. In the derivation of the DC 
decomposition of the objective function, we use some results 
from 1 3 1, which however targets a different problem that is the 
alignment of a query image with a reference pattern. 

In the second part of our work, we extend this mani- 
fold building approach in order to explore transformation- 
invariant classification. We consider multiple sets of geo- 
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metrically transformed observations, where each set consists 
of a different class of images. We study the problem of 
constructing multiple PTMs such that each PTM represents 
one image class, and the images can be accurately classified 
with respect to their distances to the constructed PTMs. We 
propose an iterative method that jointly selects atoms for the 
representative patterns of all classes. We define an objective 
function that is a weighted combination of a classification 
and a data approximation error term. Then, we select atoms 
by minimizing a two- stage approximation of the objective 
function as in the first part. Experimental results indicate that 
the approaches proposed for single and multiple manifold 
computation perform well in transformation-invariant approx- 
imation and classification applications in comparison with 
baseline methods. 

The rest of the paper is organized as follows. In Section 
we give a review of related work. In Section III we 
discuss the manifold computation problem for transformation- 
invariant approximation of image signals. Then, in Section 
IV we present an extension of the proposed scheme for 
transformation-invariant classification. We discuss the com- 
plexity of the proposed methods in Section [V] Finally, we 
conclude in Section IVll 



II. Related Work 

Our study is linked to two main topics; manifold learning 
and sparse signal representations. Firstly, our PTM building 
approach can be seen as a special instance of manifold 
learning with prior information on the data model. Manifold 
learning refers to the recovery of low-dimensional structures 
in high-dimensional data. Many methods have recently been 
proposed in this field. The ISOMAP algorithm |4| computes 
a global parameterization of data based on the preservation 
of geodesic distances, while the LLE algorithm | 5 1 maps the 
data to a lower-dimensional domain using its locally linear 
structure. The Hessian Eigenmaps | 6 | algorithm has achieved 
some improvements on LLE, as it also involves some higher- 
order geometric characteristics of the data. However, such 
approaches have the following three main shortcomings. First, 
they compute a parameterization for the initially available data, 
and their generalization for the parameterization of additional 
data is not straightforward. A method has been proposed in 
I? I that provides out-of-sample extensions for some common 
manifold learning algorithms. The authors interpret these 
algorithms as learning the eigenvectors of a data-dependent 
kernel, and then generalize the eigenvectors to the contin- 
uous domain in order to compute eigenfunctions. Second, 
the aforementioned methods lack the means for synthesizing 
new samples that conform to the same manifold model. This 
observation is one of the motivations of the method presented 
in 0, which computes a smooth tangent field with the use of 
analytic functions and thus yields a smooth manifold structure 
that makes the generation of novel points possible. Also, 
in 1^ a method is proposed for synthesizing new images 
based on the LLE algorithm. Third, most of the methods 
that do not allow the synthesis of new data do not have 
inmiediate generalizations for classification applications. An 



exception is the work presented in |10|. The authors propose 
the SLLE algorithm, where LLE is modified such that the 
discrimination between different class samples is encouraged 
in the computation of the data embedding. 

All of the methods mentioned above are generic methods 
that make no assumption on the type of the manifold underly- 
ing the observed data. If they are applied on a data set sampled 
from a transformation manifold, the embedding computed 
with these generic methods does not necessarily reflect the 
real transformation parameters. Our learning algorithm differs 
from these methods essentially in the fact that it uses the 
information about the model generating the data, employs it 
for learning an accurate representation, and also computes the 
exact transformation parameter vectors. Since the manifold is 
constructed in a parametric form, the mapping between the 
parameter domain and the high-dimensional signal domain 
is perfectly known. Thus, one can generate new samples on 
the manifold and compute the parametrizations of initially 
unavailable data simply by finding their projections on the 
manifold. This also permits the estimation of the distance 
between a test image and the computed manifolds. Conse- 
quently, it is possible to assign class labels to test images in 
a transformation-invariant way by comparing their distances 
to the computed class-representative manifolds. Finally, as 
demonstrated by some of our experiments, the incorporation 
of the model knowledge into the manifold learning procedure 
brings important advantages such as robustness to data noise 
and sparse sampling of data, in comparison with generic 
methods based on local linearity assumptions. 

The method proposed in ifTTIl is related to our work in 
the sense that it computes a simultaneous alignment of a 
set of images that have undergone transformations, where the 
application of the method to classification problems is also 
demonstrated. However, their technique is essentially different 
from ours as it is based on the idea of "congealing" via the 
minimization of entropy in the corresponding pixels of aligned 
images. Next, our paper uses the idea of learning by fitting a 
parametric model to the data. It is possible to find several other 
examples of this kind of approach in the literature. For exam- 
ple, the article |[T2ll is a survey on locally weighted learning, 
where regression methods for computing linear and nonlinear 
parametric models are discussed. The efficient computation 
of locally weighted polynomial regression is the focus of 
|13|. Meanwhile, the method in 1 14 | applies locally weighted 
regression techniques to the appearance-based pose estimation 
problem. Then, we remark the following about the relation 
between this work and the field of sparse signal approxima- 
tions. Since we achieve a greedy construction of representa- 
tive patterns, our method bears some resemblance to sparse 
approximation algorithms such as Matching Pursuit (MP) UJ 
or Simultaneous Orthogonal Matching Pursuit (SOMP) [TSl . 
There are also common points between our method and the 
Supervised Atom Selection (SAS) algorithm proposed in (161, 
which is a classification-driven sparse approximation method. 
SAS selects a subset of atoms from a discrete dictionary by 
minimizing a cost function involving a class separability term 
and an approximation term. However, the main contributions 
of this work in comparison with such algorithms lie in 
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the following. Firstly, we achieve a transformation-invariant 
approximation of signals due to the transformation manifold 
model. Furthermore, we employ an optimization procedure 
for computing the atom parameters that provide an accurate 
approximation (or classification) of signals. This corresponds 
to learning atoms from a dictionary manifold, whereas meth- 
ods such as MP and SOMP pick atoms from a predefined 
discrete dictionary. This also suggests that it is possible to find 
connections between our work and transformation-invariant 
dictionary learning, where a sparse representation of signals is 
sought not only in terms of the original atoms but also in their 
geometrically transformed versions. So far, transformation- 
invariance in sparse approximations has been mostly studied 
for shift-invariance as in 1 17| and fW\, and for scale-invariance 
as in 1 19 L \20\. The work presented in [21 1 also achieves 
shift-invariance in the sparse decomposition via a continuous 
basis pursuit. Our new PTM learning method involves the 
formation of atoms that ensure invariance to a relatively wide 
range of geometric transformations in comparison with the 
above works. Our study may thus provide some insight into 
transformation-invariance in sparse approximations as well. 

III. Computation of PTMs for Signal 
Approximation 

A. Problem Formulation 

The PTM computation problem can be briefly explained 
as follows. Given a set of observations {ui}, we would like 
to compute a pattern p such that its transformation manifold 
M (p) (the set of geometrically transformed versions of p) fits 
the observations {ui}. Therefore, we look for a pattern p such 
that the total distance between M{p) and {ui} is minimized, 
which is illustrated in Figure [T] Now we define the problem 
formally. 

Let p e L^(R^) be a visual pattern, where L^(R^) denotes 
the set of square-integrable functions on R^. Let A C R^ 
be a closed parameter domain, and A G A be a parameter 
vector. We define Ax{p) G L^(R^) as the pattern that is 
generated by applying the geometric transformation specified 
by A to p. For instance, if A = (tx^ty) represents a 2-D 
translation, then Ax{p) corresponds to a translated version 
of p by (txjty). The relation between the two patterns is 
expressed as Ax{p){x^y) = p{x' ^y'), where the two pairs 
of coordinate variables are related as {x' ^y') = a{X,x,y). 
We assume that a is a smooth (C^) function. Also, defining 
a\{x^y) := a{X^x^y) for a fixed A G A, we assume 
that ttA : R^ ^ R^ is a bijection. Then, we define the 
transformation manifold of p a j^l 

M{p) = {Ux{p):Xe A} cR^, (1) 

^A4(p) is a Riemannian manifold with the Riemannian metric given by 
9ijW = (^^v^' ^^dx^^ where (, } denotes the usual inner product in 
and A^, \j denote the i-th and j-th transformation parameters. 

^In this paper, we demonstrate our method on the transformation mod- 
els flT) and {33) . The transformation manifold of the model in J33j is 
a transformation group called the similitude group, where the manifold 
A{p) = {Ax{p) : A G A} C L2(R2) (^^g counterpart of M{p) in the 
continuous space) corresponds to the group orbit of p. However, it should be 
noted that the model in fTT| does not correspond to a transformation group. 




Fig. 1. The set {ui} of geometrically transformed observations is approx- 
imated with the transformation manifold M.(jp) of a representative pattern 
p. 

where U\{p) G R^ is an n-dimensional discretization of 

Let ||.|| denote the ^2 -norm in R^. For a given G R^, let 
A = argmin^ y-u — /7)^(p)||. Then Ux{p) is called a projection 
ofuon M (p) . In this case, the distance between u and M (p) 
is given by 

d{u,M{p)) = \\u - Ux{p)\\ = min \\u - Ux{p)\\. (2) 

aga 

Let U = {ui}^-^ C R^ be a set of observations of a 
geometrically transformed visual signal. We would like to 
describe these observations as Ui = Ux^ip) + by the 
transformations Ux^ip) of a common representative pattern p, 
where the term indicates the deviation of Ui from M (p) . In 
the selection of p, the objective is to approximate the images 
in U accurately. We represent the approximation accuracy 
in terms of the distance of the input images to M{p). We 
formalize this problem as follows. 

Problem 1: Given images U = {ui}^^^, compute a pattern 
p G I/^(R^) and a set of transformation parameter vectors 
{^i}iLi ^ by minimizing 

N 

E = J2\Wi-UxAp)f. (3) 

The error E corresponds to the total squared distance of 
the input images to M {p) . In order to solve Problem [T] we 
propose to construct p as a sparse linear combination of some 
parametric atoms from a dictionary manifold 

V = {B^{^):jeT}cL\R^). (4) 

^When sampling Ax{p) to get Ux{p), we fix a rectangular window on R^, 
and a regular sampling grid once and for all. Note that defining the pattern 
transformations in the continuous space L^(R^) instead of R^, together with 
constructing p with parametric atoms in L^(R^), saves us from resampling 
and interpolation ambiguities. In other words, since the sampling grid is fixed 
and p(x, y), and thus Ax(p){x, y) are analytically known functions, there is 
a uniqu e way of generating Ux(p) for any A. Note also that, as explained in 
Section |III-B| our method does not require the transformation of the discrete 
input images {ui} C R^, i.e., {ui} are used as given, and transformations 
are always applied to p throughout the algorithm. A rectangular window for 
sampling is a suitable choice in our application, as the atoms typically have 
good time-localization (such as Gaussians), and so does p. Moreover, if the 
generation of atoms involves a scaling and translation of the mother function, 
which is often the case, the method has a natural adaptivity for different 
window sizes, window locations and sampling rates due to the scalability and 
position of the atoms, and therefore p. 
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Here, each atom B^{(j)) e L^(R^) is derived from the 
analytic mother function (j) G L^(R^) through a geometric 
transformation specified by a parameter vector 7. An 
atom is thus given by B^{(j)){x^y) = (j){x' ^y'), where 
{x' ^y') = 6(7, X, 7/). We assume that 6 is a smooth function, 
and that h^{x^y) := 6(7, x, 7/), \ ^ is a bijection 
for any fixed 7 G P. The parameter domain V is assumed 
to be a closed and convex subset of W for some 5, where 
5 is the number of transformation parameters generating V. 
Hence, V is an s-manifold. Let us write (j)^ = B^{(j)) for 
simplicity. We would like to obtain the representative pattern 
in the form p = Ylf=i ^3 ^ij ^ combination of K atoms 
with coefficients {cj}. Under these assumptions, we 
reformulate the previous problem as follows]^ 

Problem 2: Given images U = {ui}^-^, an analytic mother 
function (j), and a sparsity constraint K; compute a set of 
atom parameter vectors {7j}jLi C F, a set of coefficients 
{cjlj^i C M, and a set of transformation parameter vectors 
{Xi}^i C A, by minimizing 

N K 

E = Y,\\n^-U^.{Y.^Jh,W• (5) 

i=i 3=1 

Note that the construction of p with smooth atoms assures 
the smoothness of the resulting transformation manifold. A 
manifold point Ux{p) G is given by the discretization of 
the function 

K 

Ax{p){x,y) =p{ax{x,y)) = ^Cj (l)^^{ax{x,y)) 

(6) 

= X^Cj oax{x,y)) 

where the notation o stands for function composition. Here 
ax{x^y) is a smooth function of A; and b and (j) are smooth 
functions, too. Therefore, Ax{p){x^y) is a smooth function 
of A. Then, each component Ux{p){l) of Ux{p) is a smooth 
function of A, for / = 1, ... , n. 

B. PTM Building Algorithm 

We now build on our previous work fT\ and describe 
an algorithm for the solution of Problem [2] Due to the 
complicated dependence of E on the atom and projection 
parameters, it is hard to find an optimal solution for Problem 
[2] Thus, we propose here a constructive approach. We build 
the pattern p iteratively by selecting atoms from P in a greedy 
manner. Each successive version pj of the pattern p leads to 
a different manifold M{pj), whose form gradually converges 
to the the final solution M {p) . During the optimization of the 
atom parameters in each iteration, we first locate a good initial 

"^Whether the span of the dictionary V is dense in L^(R^) depends on 
the mother function as well as the transformation b. In this paper, we 
present results where V is generated by the rotation, anisotropic scaling and 
translation of the mother function. The proof of Proposition 2.1.2 in |22 | 
shows that for this very transformation model, the linear span of V is dense 
in L^(R^) as long as cj) has nontrivial support, i.e., unless (f)(x, y) = almost 
everywhere. 
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Fig. 2. The parameter vectors corresponding to the projections of the point 
Ui on the previous manifold A4 (pj _ 1 ) and the updated manifold A4 (pj ) are 
shown respectively by and A^. 

solution by minimizing a DC approximation of the objective 
function using DC programming. We then refine our solution 
by using a locally linear approximation of the manifold near 
each input image and minimizing the total tangent distance to 
the manifold with gradient descent. The reason for our choice 
of a two-step optimization in atom selection is the following. 
The DC solver used in our implementation is the cutting 
plane algorithm, which slows down as the number of vertices 
increases throughout the iterations. Therefore, in practice, we 
use the DC programming step for approaching the vicinity of 
a good solution and we terminate it when it slows down. Then, 
we continue the minimization of the function with gradient 
descent. Considering that the DC program is not affected 
by local minima and gradient descent is susceptible to local 
minima, using these two methods respectively for the first and 
second parts is a suitable choice. We start by giving a brief 
discussion of DC functions 1231 that are used in our algorithm. 

Definition 1: A real valued function / defined on a convex 
set C C MMs called DC on C if for all x e C, f can be 
expressed in the form 

fix) = g{x) - h{x) (7) 

where g, h are convex functions on C. The representation ^ 
is said to be a DC decomposition of /. 

An important fact about DC functions is the f olio win 1231 . 

Proposition 1: Every function / : ^ R whose second 
partial derivatives are continuous everywhere is DC. 

The global minimum of DC functions can be computed 
using DC solvers such as the cutting plane algorithm and the 
branch-and-bound algorithm 125], which is a major reason for 
the choice of DC programming in this work. There are also 
some DC optimization methods such as DCA |26| and the 

^Proposition ^ is the original statement of Corollary 4.1 in |23 |, which 
holds for functions defined on R*. However, a function defined on a convex 
subset of W with continuous second partial derivatives is also DC. This can 
be easily seen by referring to the proof of Corollary 4.1 in |23 |, which is 
based on the fact that locally DC functions are DC, and to Hartman's proof 
[24 1 that locally DC functions defined on a convex subset of are DC on 
the same domain. 
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Fig. 3. Si (pj ) is the first order approximation of the manifold A4 (pj ) around 
Ux-{pj). Here, the difference vector Ci between Ui and its exact projection 
on M(pj) is approximated by the difference vector between Ui and its 
projection on Si(pj). 

concave-convex procedure (CCCP) fTf], which have favorable 
computational complexities and converge to a local minimum. 
The theoretical guarantee for finding the global minimum with 
the cutting plane algorithm is lost when the DC program is 
terminated before exact convergence as in our implementation; 
however, the overall two-step minimization gives good results 
in practice. 

Equipped with the DC formalism, we can now describe our 
iterative manifold learning algorithm. As the atom selection 
procedure requires the computation of the distance between 
the input images and the PTM, the algorithm initially needs 
a rough estimate of the parameter vectors. Therefore, we first 
assign a tentative set of parameter vectors {A^} to the images 
{ui} by projecting {ui} onto some reference transformation 
manifold A^(^). The pattern ^ can be possibly chosen as a 
typical pattern in the input set (an L^(R^) -representation of 
some Ui). Then, the parameter vector assigned to an image 
is given by = argmin^GA H'^i — Ux{'^)\\. We compute 
the transformation parameters by first roughly locating the 
projections with the help of a grid, and then performing a 
line search near the closest grid point. 

Now let us describe the j-th iteration of the algorithm. Let 
Pj-i denote the pattern consisting of j — 1 atoms (one can set 
Po=0). In the j-th iteration we would like to choose an atom 
(I) J. G V and a coefficient cj such that the data approximation 
error 

N N 

E = J2\\e,f = J2d'{u,,M{pj)) (8) 

i=l i=l 

is minimized, where pj = Pj-i + We remark that 

the cost function in ([5]) is defined as a function of all atom 
parameters {7j}jLi and coefficients {cj}f^i, however, the 
one in ^ is considered only as a function of and cj. For 
simplicity, we use the same symbol E for these two functions 
with an abuse of notation. 

Notice that the values of {A^} may change between itera- 
tions j — I and j, because the projection points change when 
the manifold is updated. The alteration of {A^} is illustrated 
in Figure |2] At the beginning of the j-th iteration, the vectors 
{Xi} take the values computed at the end of iteration j — I 
by projecting {ui} on M{pj-i). Therefore, d{ui^M{pj)) 
— Ux.{pj)\\ in general. In the minimization of E, it 



is not easy to formulate and compute the exact distance 
d{ui, M{pj)), since it would require the formulation of A^ 
as a function of the optimization variables, which does not 
have a known closed-form expression. Therefore, we propose 
to minimize E in two stages. Let 7 = 7^ and c = Cj denote the 
parameters and the coefficient of the new atom for the ease of 
notation. In the first stage, we define a coarse approximatioij^ 

N N 

i=l i=l .QX 

N 
i=l 

of E, where Vi = Ui — U\.{pj-i) is a constant with respect 
to 7 and c. We have the following proposition. 

Proposition 2: £^ is a DC function of 7 and c. Moreover, 
if a DC decomposition for the components (pixels) of the 
transformed atom Ux{(})^) is known, a DC decomposition of 
E is computable. 

The proof of Proposition [2] is given in Appendix A. Al- 
though finding the DC decomposition of an arbitrary function 
is an open problem, DC decompositions are available for 
important function classes |25|. See, for instance, |3 | for the 
derivation of the DC decompositions of several elementary 
functions, and f25l for operations with known DC decompo- 
sitions. For the rest of our discussion, we assume that a DC 
decomposition of the components of U\.{(})^) is computable. 
We can therefore minimize E using the cutting plane algorithm 
discussed in |25| and |3|. This provides an initial solution for 
the atom that is optimized further in the next stage. 

In the second stage of our method, we approximate E 
by another function E, which is the sum of the squared 
tangent distances of {ui} to the updated manifold M{pj). 
Let Si{pj) denote the first order approximation of M{pj) 
around Ux.{pj), where A^ is still as computed at the end of 
iteration j — I. Then, the distance d{ui^Si{pj)) between Ui 
and Si{pj) is called the tangent distance |28| and it provides 
an approximation for d{ui^M{pj)) (illustrated in Figure [sj. 
Hence, E is given by 

N N 

E = Y.\\e£ = Y.d\u,,Si{p,)). (10) 

The complete derivations of E, Si{pj) and the distance to 
Si{pj) are given in Appendix B, where we use results from 
our previous work |29|. We minimize E over (7,0) using a 
gradient descent algorithm. At the end of this second stage, 
we finally obtain our solution for the atom parameters 7 and 
the coefficient c. 

The new atom is then added to the representative pattern 
such that Pj = pj-i-\-c(j)^. Since pj is updated, we recompute 
the projections of {ui} on the new manif old (p j ) and update 

^The operator Ux is linear, since for two patterns p, r, and a scalar c, 
we have Ax(cp + r)(x,y) = (cp + r)(ax(x,y)) = cp(ax{x,y)) + 
r(ax(x,y)) = cAx(p)(x,y) + Ax(r)(x,y). 
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{Xi} such that they correspond to the new projection points. 
The projections can be recomputed by performing a search in 
a small region around their previous locations. 

We continue the iterative approximation algorithm until the 
change in E becomes insignificant or a predefined sparsity 
constraint is reached. We also finalize the algorithm in case 
an update increases E, which might occur as the atom 
selection is done by minimizing the approximations of E. 
The termination of the algorithm is guaranteed as E is forced 
to be non-increasing throughout the iterations. However, due 
to the complicated structure of the method that uses several 
approximations of E, it is hard to provide a theoretical 
guarantee that the solution p, {Xi}fLi converges, even if that 
has been the case in all experiments. We name this method 
Parameterized Atom Selection (PATS) and summarize it 
in Algorithm [T] The complexity of the algorithm will be 
discussed in Section |V] As a final remark, we discuss the 
accuracy of reformulating of the objective function £^ in ([3]) 
in several stages of the algorithm. Firstly, the error arising 
from approximating ^ with ([5]) asymptotically approaches 
as the number of atoms in the sparse approximation is 
increased, provided that the span of V is dense in L^(R^). 
Then, the gradual minimization of ([5]) via minimizing ([8| 
also introduces an error, which is a common feature of 
greedy algorithms. Next, the deviation of ^ in ([9]) from 
£^ in ([8]) mainly depends on the amount of change in the 
transformation parameters between consecutive iterations. 
Starting the algorithm with a good initialization of parameters 
helps to reduce this error. Moreover, the inaccuracy caused 
by this approximation is partially compensated for in the next 
stage as E accounts for parameter changes. The accuracy 
of this second approximation essentially depends on the 
nonlinearity of the manifold; i.e., E = E if the manifold 
is linear. However, even if the manifold has high curvature 
the approximation E ^ E is accurate if the change in 
the transformation parameters is small between adjacent 
iterations, which is often the case, particularly in the late 
phases of the algorithm. 



C. Experimental Results 

We now present experimental results demonstrating the 
application of PATS in transformation-invariant image ap- 
proximation. We first describe the experimental setup. We 
experiment on a PTM model given by 

M{p) = {Ux{p) : A = {0,U,ty,s,,Sy) G A} C (11) 

where 6 denotes a rotation, tx and ty represent translations in 
X and y directions, and and Sy define an anisotropic scaling 
in X and y directions. Ux{p) is a discretization of Ax{p), where 
Ax{p){x,y) =p{x',y') and 
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(12) 

We choose a dictionary manifold model given by 

V = {5^((/)) : 7 = {^,Tx,Ty,ax,ay) G T} C L\R^) (13) 



Algorithm 1 Parameterized Atom Selection (PATS) 
1: Input: 

U = {ui}fLi'. Set of observations 
2: Initialization: 

3: Determine a tentative set of parameter vectors {Xi} by 
projecting {ui} on the transformation manifold A^(^) of 
a reference pattern ^. 

4: po = 0. 
5: j = 0. 

repeat 

Optimize the parameters 7 and the coefficient c of the 
new atom with DC programming such that the error E 
in ([9]) is minimized. 
9: Further optimize 7 and c with gradient descent by 

minimizing the error E in (10). 
10: Update pj = pj-i -\- c(j)^. 

11: Update parameter vectors {A^} by projecting {ui} onto 

12: until the approximation error E converges or increases 
13: Output: 

p = Pj'. A representative pattern whose transformation 
manifold M. (p) fits the input data U 



where is 3. rotation parameter, and Ty denote translations 
in X and y directions, and ax and ay represent anisotropic 
scalings in x and y directions. The geometric transformation 
between the mother function and an atom is thus given by 

(l)j{x,y) = (l){x',y'), where 
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(14) 



The mother function 



(t){x,y) 



is taken as a Gaussian function. 

2 -{x^W) 



(15) 



In Appendix C, we describe the computation of the DC 
decompositions of Ux{(j)^) and the error E for this setup. 

In the first set of experiments, we test the PATS algorithm 
on two data sets, which consist of handwritten "5" digits 
and face images. The first data set is generated from the 
MNIST handwritten digits database |30| by applying random 
geometric transformations to 30 randomly selected images of 
the "5" digit. The second data set consists of 35 geometrically 
transformed face images of a single subject with facial expres- 
sion variations | 31 1, which is regarded as a source of deviation 
of the data from the manifold. Both data sets are generated by 
applying rotations, anisotropic scalings and translations. 

In the experiments we measure the data approximation error 
of the learned pattern, which is the average squared distance of 
input images to the computed transformation manifold. In the 
plots, the data approximation error is normalized with respect 
to the average squared norm of input images. 

In order to evaluate the performance of the PATS method, 
we compare it with the following baseline approaches. 

• MP on average pattern: We determine a representative 
pattern (average pattern) by picking the untransformed 
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image that is closest to the centroid of all untransformed 
data set images. Then, we obtain progressive approxima- 
tions of the average pattern with Matching Pursuit |1 1. 

• SMP on aligned patterns: We obtain a progressive si- 
multaneous approximation of untransformed images with 
the Simultaneous Matching Pursuit algorithm explained 
in 1321 . SMP selects in each iteration one atom that ap- 
proximates all images simultaneously, but the coefficient 
of the atom is different for each image. We construct a 
pattern gradually by adding the atoms chosen by SMP 
and weighting them with their average coefficients. 

• Locally linear approximation: We compute the locally 
linear approximation error, which is the average distance 
between an image and its projection onto the plane 
passing through its nearest neighbors. We include this 
error, since typical manifold learning algorithms such as 
ISll and O use a linear approximation of the manifold. 

The dictionary used in the first two methods above is 
a redundant sampling of the dictionary manifold in ([13]). 



The results obtained on the digit and face images are given 
respectively in Figures ffland [5] Some images from each data 
set are shown in Figures ]4(a)l and [5(a) The patterns built with 
the proposed method are displayed in Figures [4(b)] and [5(b)| It 
is seen that the common characteristics of the input images are 
well captured in the learned patterns. The data approximation 



errors of the compared methods are plotted in Figures 4(c) 



and |5(c)[ The errors of the PTM-based methods are plotted 
with respect to the number of atoms used in the progressive 
generation of patterns. The results show that the proposed 
method provides a better approximation accuracy than the 
other approaches. The approximation accuracies of MP and 
SMP are better in the face images experiment compared to 
the digits experiment. This can be explained by the fact 
that face images of the same subject have smaller numerical 
variation with respect to handwritten digit images; therefore, 
an average pattern in the data set can approximate the others 
relatively wellj^ One can also observe that the locally linear 
approximation error is significantly high. The local linearity 
assumption fails in these experiments because of the sparse 
sampling of the data (small number of images), whereas 
PTM-based methods are much less affected by such sampling 
conditions. 

In a second experiment, we study the effect of occlusions 
and outliers in PTM building. We experiment on the same dig- 
its data set as before, with a transformation model consisting 
of 2-D translations, where only the parameters tx^ty in ( [TT] ) 
are used. The images are randomly occluded with horizontal 



and vertical stripes as shown in Figure |6(a)| We generate four 
different data sets, where the first one consists of 150 images 
of only the digit "2". We obtain the other data sets by adding 
the first data set outliers consisting of a mixture of "3", "5" 
and "8" digits, where the outlier/inlier ratio is 10%, 20% and 
30%. We test the PATS method using a dictionary generated 
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Fig. 4. Manifold approximation results with handwritten digits ("5" ) 
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(a) Images from the face data set 
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^In an evaluation on the aligned and normalized versions of the input 
images, the average squared distance to the centroid is found as 0.40 for 
the digit images and 0.01 for the face images. 



Fig. 5. Manifold approximation results with face images 




(a) Images from the occluded digits data set 




(b) Learned patterns. From left to right: 0%, 10%, 
20%, 30% outHers. 
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Fig. 6. Manifold approximation results with occluded digit images with 
outliers 



with the inverse multiquadric mother function given by 

^{x,y) = {l^x^ ^y^Y, /i<0. 

We have set /i = —3 in the experiments. The computation of 
the DC decomposition for this mother function is explained in 
Appendix C. The patterns learned with all four data sets are 



shown in Figure 6(b) and the errors are plotted in Figure 6(c) 



The errors obtained with SMP on aligned patterns are also 
given for comparison. It is shown that the proposed method 
can recover a representative "2" digit in spite of the occlusions. 
As the ratio of outliers is augmented, the characteristics of the 
learned pattern gradually diverge from the "2" digit; and the 
approximation error increases as the average deviation of the 
data from the "2" manifold is increased. 

Then, in a third experiment, we search the effect of some 
algorithm settings on the performance of PATS. We experiment 
on a data set from the Extended Yale Face Database B |33 | 
where face images are captured under varying illumination 
conditions. We create a data set of 90 images by applying 
geometric transformations consisting of anisotropic scaling to 
the images of a single subject, where only the parameters 
Sx ^Sy'm{ TTjjare used. Some sample data set images are shown 
in Figure 7(a)[ We apply the PATS algorithm in three different 
settings. In the first setting, the algorithm is used in its normal 
mode; i.e., in line |3] of the algorithm, parameters are initialized 
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(a) Images from the Yale face data set 




(b) Learned patterns. From left to 
right: Normal setting, without gradi- 
ent descent, bad initialization. 
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Fig. 7. Manifold approximation results with face images with varied 
illumination conditions 



with respect to the data set image having the smallest distance 
to the centroid of all images. In the second setting, the 
initialization is done in the same way; however, line |9] of the 
algorithm (gradient descent) is omitted. The third setting is the 
same as the first setting except that the algorithm is started with 
a bad initialization, where the alignment in line [3] is done with 
respect to the data set image having the largest distance to the 
centroid. The patterns learned in all three settings are shown 



in Figure 7(b) and the approximation errors are plotted in 
Figure [7(c) The algorithm does not output clear facial features 
due to the variation of illumination. The gradient descent step 
is seen to bring a certain improvement in the performance. 
The results also show that the algorithm has a sensitivity to 
initialization. A significant change in the initial values of the 
transformation parameters causes the algorithm to compute a 
different solution. In order to provide a comparison, we also 
plot the results obtained with "SMP on aligned patterns" with 
default and bad alignments. The fact that the error difference 
between the two cases is much larger in SMP compared to 
PATS suggests that PATS can nevertheless compensate for a 
bad initialization of transformation parameters to some extent. 

Finally, in a last experiment we examine the approximation 
accuracy of the learned manifold with respect to the noise level 
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A. Problem Formulation 



Noise variance 



X 10" 



Fig. 8. Dependence of the approximation error on the data noise. The largest 
noise variance 2 x 10 ~^ corresponds to an SNR of 9.054 dB. 



of the data set. We form a synthetic pattern r that is composed 
of 10 randomly selected atoms from V. Then, we generate a 
data set U of 50 images by applying to r random geometric 
transformations of the form ([TT]). We derive several data sets 
from U by corrupting its images with additive Gaussian noise, 
where each data set has a different noise variance. Then, we 
run the PATS algorithm on each data set. In Figure [8] the 
data approximation error is plotted with respect to the noise 
variance. The deviation between U and M{r) depends on 
the noise level, and the ideal approximation error is linearly 
proportional to the noise variance. Such a linear dependency 
can be observed in Figure [8] However, one can note that the 
curve does not pass through the origin, which is due to the 
suboptimal greedy nature of the algorithm. 

IV. Joint Computation of PTMs for Classification 

In this section we consider multiple image sets, where each 
set consists of geometrically transformed observations of a 
different signal class. We build on the scheme presented in 



Section III-B and extend the PATS algorithm for joint manifold 
computation in classification applications. 

We remark the following about the use of PTM models in 
transformation-invariant classification. Given a collection of 
PTMs representing different classes, for each manifold one 
can identify a subset of the whole image space that consists 
of points whose distances to that manifold are smaller than 
their distances to the other manifolds. We call this subset of 
the image space as the "approximation region" of the manifold 
(see 1341 , Section II for a more detailed discussion). Note that 
it is not always possible to partition the whole image space 
into the approximation regions of a set of class-representative 
PTMs. For instance, one may come across degeneracies re- 
sulting from manifold intersections; or there may exist a full- 
dimensional subset of the image space that is equidistant 
to two manifolds. Yet, our PTM computing approach relies 
on the implicit assumption that in a transformation-invariant 
classification application, the training and test signals that 
belong to a certain class are in practice likely to be located 
close to the approximation region of a PTM. 



Consider a collection of visual signals U = IJm=i^^ 
consisting of M classes, where each subset 



C 



{u^}-^ consists of Nm geometrically transformed observa- 
tions of a visual signal of class m. We would like to represent 
each set by a transformation manifold M{p^) that is 
generated by the geometric transformations of a representative 
pattern p'^. Let us denote 



M"^ = M(p^) = {Uxip"^), A G A} c 



(16) 



We would like to build {M^} such that they provide a good 
representation of the images in U and also permit to classify 
them accurately by manifold distance computation. Hence, in 
the construction of the manifolds, we formulate the objective 
function as a weighted combination of two terms Ea and Ec, 
which respectively represent approximation and classification 
errors. The approximation error Ea is given by the sum of the 
squared distances of images to the manifold of the same class 



M Nrr 



M Nrr 



=1 i=i 



We assume that an image is assigned the class label of the 
manifold with smallest distance to it. We define a misclassifi- 
cation indicator function / such that for uT' G 



^ ^ ^ 1 1, otherwise. 



(18) 

Then, the classification error Ec is the total number of mis- 
classified data points. 



M Nrr 



(19) 



m=l i=l 



We would like to compute {M'^}^^i such that the weighted 
error 

E = Ea^aEc (20) 

is minimized, where a > is a coefficient adjusting the 
weight between the approximation and classification terms. 
We formulate a generic PTM learning problem as follows. 

Problem 3: Given image sets {U^}, compute patterns 
l^mj ^ L^(R^) and transformation parameters {A^} C A, 
m = 1, ... , M and z = 1, . . . , Nm, by minimizing 



M Nrrr 

^ = EE(IK-^^r(^'" 

m=l i=l 



a 



HO) . 



(21) 



Our solution is based on constructing each p'^ using 
atoms from the dictionary manifold V defined in (|4]). We 
reformulate Problem [3] under these assumptions. 

Problem 4: Given image sets {W^}, a mother function (j) 
and sparsity constraints {Km}', compute a set of atom param- 
eters {jj^} C F, coefficients {cj^} C M, and transformation 
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parameters {A^} C A for m = 1, . . . , M, j = 1, . . . , 
and i = 1, ... , Nm, by minimizing 

M Nrr^ Km, 

m=l i=l ji = l 



■ a/(^/r))- (22) 



5. Classification-Driven PTM Learning 

Problem [4] is similar to Problem [2j except that it also 
involves a classification error term that has a quite complex 
dependence on the optimization variables. Therefore, it is hard 
to solve optimally. We present a constructive solution based 
on building {p^} iteratively with joint atom selection. 

We begin with a tentative assignment of parameter vectors. 
In (22) each vector corresponds to the projection of 
on A^^. We assign {A^} by picking a reference pattern 
for each class and then projecting each onto M{^^). We 
also compute the cross-projection vectors {A^'^}, where 



\m,r • II n 

A- ' = argmm \ \u- 
xeA 



UxiplW 



corresponds to the projection of u'^ onto A4^. 

Then, we construct {p^} by gradually adding new atoms 
to each p^. In the j-th iteration of the algorithm, we would 
like to optimize the parameters jj^ and coefficients cj^ of 
the new atoms such that the weighted error E is minimized. 
Now we consider the j-th iteration and denote 7^ = 7^^, 
^ jYicn 7 = [7I 7^ . . . 7^] and c = [c^ . . . c^] 
are the optimization variables of the j-th iteration. We consider 



E SiS 3. function of 7 and c similarly to Section III and 
propose to minimize E through a two-stage optimization. We 
first obtain an approximation E of E, which is in a DC 
form. We minimize E using the cutting plane algorithm and 
estimate a coarse solution, which is used as an initial solution 
in the second stage. Then in the second stage, we define a 
refined approximation E of E based on the tangent distances 
of images to the manifolds and minimize it with a gradient- 
descent algorithm. 

The minimization of E and E determines a solution for 7 
and c. We update the pattern of each class by adding it 
the selected atom with parameters 7^ and coefficient (in 
practice, we add an atom only if its coefficient is significant 
enough). Then, we recompute the transformation parameters 
{A^} and {A^'^} by projecting the images onto the new 
manifolds. We have observed that selecting the atoms by 
minimizing a combination of approximation and classification 
terms instead of only a classification term gives better results, 
especially for robustness to data noise. Still, we would like to 
make sure that the selected atoms improve the classification 
performance at the end of an iteration. Therefore, the decision 
of accepting the updates on the manifolds is taken according 
to the classification error Ec in ([19]). If Ec is not reduced we 
reject the updates and pass to the next iteration]^ We continue 
the iterations until the classification error Ec converges. The 

^In the course of the algorithm, parameters (3 and a are adapted such that 
the emphasis is shifted from approximation capabilities in early phases to 
classifi cation capabilities in later phases. This is explained in more detail in 
Section [TV-C| For this reason, even if the classification error does not decrease 
in one iteration, it may do in the next one. 



termination of the algorithm is guaranteed by constraining Ec 
to be non-increasing during the iterations, which in return 
stabilizes the objective function E. We call this method Joint 
Parameterized Atom Selection (JPATS) and summarize it in 
Algorithm |2] 

Let us come to the detailed description of the approxima- 
tions of E in the two-stage optimization. Firstly, let {pJLi} 
and {M'j^_i} denote the patterns and the corresponding trans- 
formation manifolds computed after j — I iterations. For 
simplicity of notation, we will use the convention = MJ^ 
and p^ = pj^ throughout the derivations of E and E. 

In the first step, we obtain E in the form E = Ea -\- a Ec, 
where Ea and Ec are respectively the approximations of Ea 
and Ec. The first term Ea is simply given by the generalization 
of the approximation error in ([9]) to the multiple manifold case. 



M Nrr 



-mil 2 



M Nm 

EE IK 

m=l i=l 



^» = EEi 

m=l i=l ^ „ ^ 

(23) 

where the parameters A^ are the ones computed at the end of 
iteration (j - 1), and vf" = uf - Uxy^{pJIi)- 

Then, we derive Ec in the following way. Notice that the 
classification error Ec in ([19]) is a discontinuous function of 7 
and c due to the discontinuity of the misclassification indicator 
function /. Let r(ii^) denote the index of the manifold with 
smallest distance to an image u^^ among the manifolds of all 
classes except its own class m; i.e.. 



arg min d{u'^,M'^). 



It is clear that r{u^) can take different values throughout the 
iterations. However, for simplicity, in the j-th iteration we fix 
the indices r{uf')io their values attained at the end of iteration 
(j — 1) and denote them by the constants . Then we can 
define the function 

such that I{u^) corresponds to the unit step function of 
f{u^)\ i.e., I{u^) = u{f{u^)). Thus, if we replace the unit 
step function with the sigmoid function S{x) = (l + e~^^)~^, 
which is a common analytical approximation of the unit step, 
we obtain the approximation S(^f{u^)) = (1 + e~^^^^'^^)~^ 
of I{u^). As the value of the positive scalar P tends to infinity, 
the sigmoid function approaches the unit step function. A 
continuous approximation of Ec is thus given by 



M Nm 



171=1 



(24) 



Now, in order to minimize the function in ([24]) we do the 
following. We first compute 

foiuT) = d\uT,Mf_,) - d\uT,Mf_,) 

for each image u^. Then, applying a first-order expansion of 
5* around each fo{uY^), we obtain the following approximation 
of the error term in ( [24] ). 

M Nrr 



m=l 



EE^(/o(-r)) 



dS 
df 



/=/o(«r) 



{f{uT)-foiuT)))(25) 
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Since /o(ii^) and S{fo{u^)) are constants, the minimization 
of the expression in ( [25] ) becomes equivalent to the minimiza- 
tion of 



EE" 



m=l i=l 

where 



df 



M N, 



l^l^vTfiO (26) 

m=l i=l 



(l + e-/^/)2 



/=/o(nr) 



Let us rearrange ( 26 ) in a more convenient form. For each 
class index m, let = : = m} consist of the 

pairs of data and class indices of images that do not belong 
to class m but have as their closest manifold among all 



manifolds except the one of their own class. Then (26) can be 
rewritten as 



M Nrr 



M 



m=l i—1 



E 

1 ii,k)eR'" 



(27) 



As it is not easy to compute the distance terms d'^{u^^M^) 
directly, we proceed with the approximation d'^{u^^M^) ~ 
\\u^ - U^k,rr.{pf_-^ + <?^7-)|P, where the value of is 
the one computed in iteration (j — 1). We finally get Ec from 
( [27] ) with this approximation. 



M Nm 

= EE 

m=l i=l 
M 

E E 

m=l {i,k)eR'^ 



' U\^{(p^ra)\Y 



(28) 



k II k,m 



where '™ = u\ 
we can define 



i). Now, from d23b and (28) 



E = Ea + aEe. 



(29) 



Proposition 3: E is 3. DC function of 7 and c. Moreover, 
if a DC decomposition for the components of the transformed 
atom Ux{(j)^) is known, a DC decomposition of E is 
computable. 

The proof of Proposition [3] is given in Appendix D. 

Now let us describe the term E that is used in the second 
stage of the optimization of E. We derive E by replacing 
the manifold distances by tangent distances; i.e., we use 
the approximation d^{u^,M^) ^ d^ {u^ , {p^)) , where 
S^ip^) is the first-order approximation of around the 
point U^k^m {p^). The tangent distance is derived in Appendix 

B. Let = - ^A-(p'^) and ^f'"^ = u^^ - U^k,m{p'^). 
Then the function Ea in ( [Tt] ) is approximated by 



Ea 



M Nrr, 

EE 

m=l i=l 



(30) 



Similarly, the classification error function in (27) is approxi 
mated by 

= EE^riK-^^((^^)^^^)"' {Trv<f 

m=l i=l 

-E E U 



(31) 



Ik. 



Here T-^ and T- denote the n x d matrices whose columns 
are the tangent vectors to the manifold dX respectively 
the points Uxrn{p'^) and U^k,m{p^). From (30) and (jsTj) we 
can finally define 

E = Ea^aEa- (32) 

Let us briefly discuss the effect of the approximations made 
on the original cost function E. The accuracy of approximating 



the unit step function with a sigmoid in ( [24| ) can be adjusted 
by changing the slope of the sigmoid (see also the note in 



Section IV-C ). Then, in order for the linear approximation of 
the sigmoid in (25) to be valid, the values of f{u^) must 
be sufficiently close to their base values fo{uY^). The effect 
of this linearization can be alleviated by updating the base 
values fo{u'^) several times in an iteration. The rest of the 



approximations are similar to those discussed in Section IH-B 



Algorithm 2 Joint Parameterized Atom Selection (JPATS) 



Input: 

U = Um=i observations for M signal classes 

Initialization: 

Determine tentative parameter vectors {A^'^} by project- 
ing {u'^} on the transformation manifolds {A^(^^)} of 
reference patterns {^^}. 

= for m = 1, ... ,M. 
J=0. 

Initialize the sigmoid parameter /3 and the weight param- 
eter a. 
repeat 



Optimize the joint atom parameters 7 = [7^7^ 



and coefficients c 



7 



Ml 



with DC program- 



10: 

11: 

12: 
13: 
14: 



15: 
16: 



ming such that the error £^ in ( 29 ) is minimized. 
Further optimize 7 and c with gradient descent such 
that the refined error E in (32) is minimized. 
Update pj" = pjl^^c"^ ^^.^ for m = 1, . . . , M if 
is significant. 

Update the parameter vectors {A^'^}. 
Update (3 and a. 

Check if the new manifolds reduce the classification 
error Ec. If not, reject the updates on p^ and {A^'^}, 
and go back to [9] 

until the classification error Ec converges 

Output: 

l^mj _ ^p'rnj- A sct of pattcms whose transformation 
manifolds {M^} represent the data classes 
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C. Implementation Details 

We now discuss some points related to the implementation 
of JPATS. We first explain the choice of the parameter /3 
in Algorithm p] Notice that the function S[f{u'^)) can also 
be interpreted as the probability of misclassifying u'^ upon 
updating the manifolds at the end of the iteration. When u^^ 
gets closer to its true manifold M^, fi^T) decreases and 
S(^f{u^)) decays to 0. Similarly, when gets away from 
A^^, SQ{u'^)) approaches 1. The probabilistic interpretation 
of the function S[f{u'^)) stems naturally from its shape. 
Consequently, the approximate error in ([24]) corresponds to 
the sum of the probabilities of misclassifying the input images. 
Based on this interpretation, we propose to update /3 according 
to the statistics drawn from the data. For each u^, we examine 
the value of /(ix^) at the beginning the iteration and the value 
of I{u^) at the end of the iteration. Then we pick j3 such 
that the shape of the sigmoid matches the I{u^) vs. f{uf') 
plot. Such an adaptive choice of j3 also provides the following 
flexibility. In early phases of the process where the total 
misclassification rate is relatively high, /3 usually has small 
values, which yields slowly changing sigmoids. Therefore, a 
relatively large portion of the input images have an effect on 
the choice of the new atoms. However, in later phases, as the 
total misclassification rate decreases, P usually takes larger 
values resulting in sharper sigmoids, which gives misclassified 
images more weight in atom selection. 

Then, we comment on the choice of the weight param- 
eter a. In principle, a can be set to have any nonnegative 
value. Setting a = corresponds to a purely approximation- 
based procedure that computes the manifolds individually 
with PATS, whereas a large a yields a learning algorithm 
that is rather driven by classification objectives. However, 
we have observed that a good choice in practice consists of 
selecting a small value for a at the beginning and increasing 
it gradually]^ This guides the algorithm to first capture the 
main characteristics of input signals, and then encourage the 
selection of features that ensure better class-separability. 

Finally, we have made the following simplification in the 
implementation of the DC programming block. The number of 
optimization variables is (s + 1)M in our problem, where s is 
the dimension of V and M is the number of classes. Although 
the cutting plane algorithm works well for low-dimensional 
solution spaces, it becomes computationally very costly in high 
dimensions. Therefore, in the implementation of JPATS we 
partition the variables into subsets and optimize the subsets 
one by one. Although there is no guarantee of finding the 
globally optimal solution in this case, we have experimentally 
observed that one can still obtain reasonably good results 
regarding the complexity-accuracy tradeoff. In order to handle 
high-dimensional solution spaces, one can alternatively replace 
the cutting plane algorithm with another DC solver such as 
DC A 1 26 1 or CCCP 123. These methods reduce the original 
DC program to the iterative solution of a pair of dual convex 
programs, which improves the computational complexity sig- 

^In our setup, we control the a parameter by using a shifted and scaled 
sigmoid function. The initial and final values of the sigmoid are around 0.5 
and 10; and its center is typically attained at iterations 5-7 of Algorithm |2] 



nificantly at the expense of losing global optimality guarantees. 
Another issue affecting the efficiency of the DC programming 
block is the size of the solution space. We have seen that it is 
useful to add a preliminary block that locates a good search 
region before the DC block. This can be achieved using a 
coarse grid in the solution space or a global search method 
such as the genetic algorithm or particle swarm optimization. 
Note that one may also minimize the objective function by 
using only a global search method. However, in experiments 
we have seen that the final value of the objective function is 
the smallest when both global search and DC optimization are 
employed. 

D. Experimental Results 

We now evaluate the performance of JPATS with experi- 
ments on transformation-invariant classification. We test the 
algorithm on two data sets consisting of handwritten digits 
1 30 1 and microbiological images |35|. In the digits experiment, 
we use the transformation manifold model in ( pT| ). In the 
microbiological images experiment, we use the model 



M{p) = {Ux{p) : A = {0,t,,ty,s) G A} C 



(33) 



where s denotes an isotropic scale change. In both experi- 



ments, we use the dictionary model in (13) and the Gaussian 
mother function in ([15]). 

The first experiment is conducted on the images of the 
"2,3,5,8,9" digits, which lead to a relatively higher mis- 
classification rate than the rest of the digits. The data sets 
are generated by randomly selecting 200 training and 200 
test images for each digit and applying random geometric 
transformations consisting of rotation, anisotropic scaling and 
translation. The images of each digit are considered as the 
observations of a different signal class. 

The second experiment is done on some sequences from the 
microbiology video collection of the Natural History Museum 
1 35 1, which contains short video clips of living protists. We 
run the experiment on 6 different species (Discocephalus sp., 
Epiclintes ambiguus, Oxytricha sp., Scyphidia sp., Stentor 
roeseli, Stylonychia sp.), and we use three sample videos for 
each one. Each species is considered as a different class. The 
manifold in ( [33] ) provides a suitable model, as the rotation and 
translations describe well the movements of the protists, and 
the isotropic scaling compensates for zoom changes. However, 
there is still some deviation from the manifold, as a result 
of noise, small nonrigid protist articulations and occasional 
recording of different individuals in different videos. For 
each species, we experiment on a subset of frames from all 
three sequences. We preprocess the frames by conversion to 
grey scale, smoothing and thresholding. Then, for each class, 
we randomly select 70 training and 35 test images. 

In the experiments we compare the methods listed below. 
In the first four methods, we apply the algorithms on the 
training images in order to build PTMs. Then we compute 
the misclassification rate of the test images. The class label 
of a test image is estimated by identifying the smallest 
distance between the image and the computed manifolds. The 
algorithms work as follows. 
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Fig. 10. Performance of the classification-driven learning algorithms on the 
microbiological images data set 



Fig. 9. Performance of the classification-driven learning algorithms on the 
handwritten digits data set 



JPATS: We jointly build PTMs for all classes with the 
proposed method. 

PATS: We compute individual PTMs for each class with 
PATS. 

SMP on aligned patterns: We compute individual PTMs 



for each class as explained in Section |III-C 
SAS on aligned patterns: We use the untrans- 
formed/aligned images of all classes and select a set of 
Gaussian atoms with SAS 1 16|. We set the weight factor 
to A = 2 in |[T6ll . Then, for each class we build a PTM by 
forming a pattern, where the selected atoms are weighted 



with their average coefficients. 

LLA: We compute a locally linear approximation using 
the training images of each class. A test image is classi- 
fied by identifying its {d + 1) -nearest neighbors among 
the training images of each class, computing its distance 
to the plane passing through the nearest neighbors, and 
comparing its distances to the planes of different classes. 
SLLE: We compute a low-dimensional embedding of 
the training images with the Supervised Locally Linear 
Embedding algorithm 1 10| and assign the class labels of 
the test images via nearest-neighbor classification in the 
embedded domain. 

LDA: Linear Discriminant Analysis on aligned data. The 
better one of linear and quadratic kernels is picked in 
each experiment. 



14 



• Neural Network: A feed-forward backpropagation net- 
work for pattern recognition is used on aligned data. 

The results are presented in Figures |9] and [TO|respectively for 
the digit and microbiological image experiments. In Figures 



9(a)| and |lQ(a)[ a data set image from each class is shown. 
Some typical representative patterns computed with JPATS, 



PATS and the reference methods are shown in Figures 9(b) 



9(e)|and p^(b)pQ(e)l Figures |9(0l and \TO(f)\ show the misclas- 
sification rates of test images (in percentage) vs. the number 
of atoms per class. Both plots are obtained by averaging 
the results of 5 repetitions of the experiment with different 
training and test sets. The results show that JPATS yields 
the best classification performance in general. Figures |9] and 
10 suggest that JPATS has better classification performance 
although PATS produces visually more pleasant patterns. This 
can be explained as follows. PATS is designed to minimize 
the approximation error; and the assessment of the visual 
quality of the computed patterns is rather dependent on their 
approximation capabilities. The local features that are common 
to different classes appear in the representative patterns of 
all these classes built with PATS, which produces an output 
that matches visual perception. However, if a local feature 
is common to several classes, its inclusion in the represen- 
tative patterns does not contribute much to the discrimination 
among classes; therefore, these non-distinctive features are not 
emphasized in the output of JPATS. On the other hand, the 
local features that are rather special to one class are more 
pronounced in JPATS compared to PATS. In fact, due to the 
classification error term in JPATS, the algorithm tends to select 
atoms that "push" a manifold away from the samples of other 
classes!^ 

Next, we examine the effect of data noise on the perfor- 
mance of JPATS. We create several data sets by corrupting the 
digits data set used in the previous experiment with additive 
Gaussian noise of different variances. For each noise level, we 
look into two cases, where only training images are corrupted 
in the first one, and both training and test images are corrupted 
in the second one. The misclassification rate of test images 
are plotted in Fig ure |l ll where (^'^rpain 

and cr'^Qst denote the 
noise variances of training and test images. The data noise 
has a small influence on the performance of the algorithm. 
The final increase in the misclassification rate is bounded by 
2.7% even when the noise energy reaches 23% of the signal 
energy. The robustness to noise is achieved due to the fact 
that the algorithm is designed to generate a smooth pattern 
that fits all images simultaneously, which enables it to smooth 
data noise. The other PTM-based methods are also expected 
to exhibit similar noise behaviors. 

Finally, we evaluate the performances of PATS and JPATS 
in a setting where the test images contain some outliers that 
do not belong to any of the classes. We run the experiment 
on the same digit data set that has been used in the previous 



^^For instance, in Figure 9(b) the top and bottom "arcs" of the "8" digit 
are not as apparent, since the other digits also have similar features (all other 
digits have the top part; and "2", "3" and "5" have a bottom part). However, 
the crossover of "8" is specific to this class; therefore, it is prominent in the 
output. Similarly, the straight edge of "9" is also characteristic of this class 
and emphasized in the learned pattern. 



experiment of Figure |9] The training phase of the algorithms 
is as before: In both methods, the manifolds are learned using 
only training images of known classes. However, test images 
are contaminated with 200 outlier images that do not belong 
to any of the target classes, where the number of test images 
in each class is also 200. Each outlier image is generated by 
randomly selecting one test image from each class, taking the 
average of these images, corrupting the average image with 
additive Gaussian noise, and finally normalizing it. Thus, all 
outlier images have unit norm, while a typical class- sample test 
image with unit scale {Sx = l^Sy = 1) also has unit norm. 
Then, test images are classified using the manifolds learned 
with PATS and JPATS as follows. If the distance between a 
test image and the closest manifold is larger than a threshold, 
the image is labeled as an "outlier"; and if this distance is 
smaller than the threshold, it is assigned the class label of 
the closest manifold as before. The experiment is repeated 
for different values of the noise variance for the Gaussian 
noise component of the outlier images. The threshold used 
for each noise level is numerically selected in a sample run 
of the experiment such that it gives the best classification 
rate and fixed for the other runs, separately for PATS and 



JPATS. The results are presented in Figure 12 which are the 
average of 5 runs. The misclassification rate is the percentage 
of test images that have not been assigned the correct class 
label or the correct "outlier" label. In the plots shown in 



Figure [12] the noise variance corresponds to the case 
that outlier images are the averages of some test images 
coming from different classes. This is the most challenging 
instance of the experiment, as outliers come from a region 
close to class samples and it is thus difficult to distinguish 
them from class samples. Consequently, the optimal threshold 
that gives the smallest misclassification rate is high for this 
instance, resulting in labeling all outliers as class samples. The 
performance of JPATS is better than PATS in this case, as the 
overall classification rate is determined by the classification 
rate of class samples. Then, as the noise variance is increased, 
the components of the outlier images in random directions in 
the image space get more dominant, making it thus easier to 
distinguish them from class samples. It is seen that JPATS 
performs better than PATS in most cases. However, for the 
smallest nonzero noise variance value = 0.25 x 10~^, PATS 
is observed to give a better classification rate than JPATS. 
This can be explained as follows. In this experiment, JPATS 
is trained according to the hypothesis that all test images 
belong to a valid class. For this reason, in order to increase 
the distance between a manifold and the samples of other 
classes, JPATS may occasionally pick some atoms that push 
a manifold away from the samples of its own class as well. 
This slight increase in the distance between the manifold and 
the samples of its own class renders it difficult for JPATS to 
distinguish between real class samples and challenging outliers 
that are very close to class samples. In order to get the best 
performance from JPATS in such a setting with outliers, one 
can tune the a parameter to a suitable value depending on 
outlier characteristics such that a sufficiently strict control is 
imposed on the distance between the learned manifolds and 
the samples of their own classes through the approximation 
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Fig. 11. Performance of JPATS on noisy data. The noise variance cr-^ = 1.6 
corresponds to an SNR of 6.35 dB. 




- JPATS 
- PATS 



1.5 2 2.5 
Noise variance 



3.5 4 

X 10"^ 



Fig. 12. Performance of PATS and JPATS in a classification setting with 
outlier test images that do not belong to any class. For the noise variance 
cr2 — Q 5 X 10~^, the ratio between the norms of the noise component and 
the average component of an outlier image is 0.65. 
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V. Complexity Analysis 



Let us now examine the complexities of the proposed 
algorithms. We begin with the PATS method summarized in 
Algorithm [T] There are three blocks in the main loop of the 
algorithm. The first one minimizes E with DC programming, 
the second one minimizes E via gradient descent, and the 
third one computes the projections of {ui} on the manifold. 
In the analysis of the first block, it is important to distinguish 
between the complexities of the DC solver and the compu- 
tation of the DC decomposition of E. The former depends 
on the selected solver. The cutting plane method involves the 
construction of polytopes in the search space; therefore, it has 
an exponential complexity in the number of atom parameters s 
(dimension of V). One iteration of the cutting plane algorithm 
typically takes a few seconds in a MATLAB implementation, 
and around 40-60 iterations are done for atom selection. 



However, one can also use a technique such as DC A \7E\. In 
this case, the solution of the dual convex programs involves the 
evaluation of the subdifferentials of the functions constituting 
the DC decomposition. In our problem, this corresponds to 
the evaluation of gradients since the decomposing functions 
are differentiable. The gradient can be numerically evaluated 
using finite differences; therefore, the complexity of such a 
solver is at most linear in s. Next, it can be seen from ([34]) 
and (35) that the cost of computing the DC decomposition 
of E is linear in the image resolution n and the number of 
samples A^. Hence, the complexity of the first block becomes 
0{2^nN) for cutting plane, and 0{snN) for DCA. In the 
analysis of the second block, it can be easily shown that the 
complexity of calculating the vector Wi — Ti{Tj Ti)~^ Tjwi in 
(36) is 0((in^), where d is the dimension of M{p). Therefore, 



the cost of computing the total squared tangent distance E 
in (36) is obtained as 0{din? N). As we minimize E with 
gradient descent using finite differences, the complexity of 
the second block is 0{sdii? N). Finally, the cost of updating 
the projections of {ui} on M{p) is 0{dnN) in our actual 
implementation, because we minimize the distance between 
each {ui} and M{p) by performing a line search along each 
dimension of A4{p). Thus, taking DCA as a reference for the 
first block, we can summarize the overall complexity of PATS 
as 0{sdn^ N). 



We examine the complexity of JPATS given in Algorithm 
|2] similarly. From ( [37] ) it is seen that the cost of computing 
the DC decomposition of E is linear in Nj and n, where 
= Ef=i Nm = The complexity of the first 

block with respect to DCA is therefore 0{sn Nj). Then, ( [30] ) 
and ( [31] ) show that the cost of computing E is 0{dn^ Nj). 
The complexity of the second block is thus 0{sdn^ Nj). 
Finally, the third block has complexity 0{dn Nj M), since 
each image is reprojected on each manifold. Therefore, the 
overall complexity of JPATS is 0{Nj{sdn^ + dnM)). 
The complexity of selecting an atom with "SMP on aligned 
patterns", which has the closest performance to the proposed 
methods, can be similarly obtained as 0{Nj n D), where 
D denotes the cardinality of the discrete dictionary used. 
We remark that the proposed method is more suitable for 
applications where the manifolds are learned "offline" and 
then used for the classification of test data. Moreover, there 
might be ways to improve the complexity-accuracy tradeoff 
depending on the application. For instance, one might prefer 
to sacrifice on accuracy for a less complex solution by omitting 
step [9] or [To] of Algorithm [2] Also, if the class-representative 
manifolds are well- separated, it may be sufficient to use the 
PATS algorithm instead of JPATS. An option for achieving a 
high-speed PTM learning is to build a tentative representative 
pattern, for instance with "SMP on aligned patterns", in a 
preliminary analysis step and register the input images with 
respect to this pattern. Then, one may speed up the learning 
significantly by discarding the projection update steps and op- 
timizing the atoms of the representative pattern by minimizing 
only the error in ^ with a fast minimizer such as the gradient 
descent algorithm. 



16 



VI. Conclusion 

We have studied the problem of building smooth pattern 
transformation manifolds for the transformation-invariant rep- 
resentation of sets of visual signals. The manifold learning 
problem is cast as the construction of a representative pattern 
as a linear combination of smooth parametric atoms. The 
manifold is then created by geometric transformations of 
this pattern. The smoothness of the computed manifolds is 
ensured by the smoothness of the constituting parametric 
atoms. We have described a single manifold learning algo- 
rithm for approximation and a multiple manifold learning 
algorithm for classification. Experimental results show that 
the proposed methods provide a good approximation and 
classification accuracy compared to reference methods. A 
future direction to explore is the amelioration of the sensitivity 
of the methods to the initialization of projection parameters. 
The presented methods are applicable to unregistered data 
that can be approximated by 2-D pattern transformations 
with a known transformation model. Our study can find sev- 
eral applications in the transformation-invariant representation, 
registration, coding and classification of images. 
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Appendix 
A. Proof of Proposition |2] 

Before showing the DC property of the objective function, 
we Hst below some useful properties of DC functions from 
L23J and L3J. 
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Proposition 4: (a) Let be a set of DC functions 

with decompositions fi=gi — hi and let {A^}^^ C M. 
Then Yl^i ^ifi the following DC decomposition 

El, El. 



i=l 




(b) Let fi=gi — hi and f2 = 92 — ^2 be DC functions with 
nonnegative convex parts ^i, ^2, ^2- Then the product 
/i /2 has the DC decomposition 

fif2 = U{gi^g2)'^{hi^h2)') 



((^l+/^2)' + (^2+/^l)'), 



which has nonnegative convex parts [ 23ll , El- 

Now we can give a proof of Proposition [2] 

Proof: Let e = v — cUx{(j)^) denote the difference vector 
between an image v and an atom transformed by A with 
a coefficient c. We first show that the components (pixels) of 
U\{(j)^) are DC functions of 7 and c. Remember that U\{(j)^) 
has been defined as a discretization of Ax{(j)^) = ^^(^^ ((/>)). 
We can write 

where all coordinate variables are related by 

(x, y) = b^{x\ y') = (h^ o ax){x, y). 
Then, we get 

Ax{B^{(t))){x,y) = (l){h^{ax{x,y))). 

The l-\h component Ux{(l)j){l) of Ux{(l)j) corresponds to a 
certain point with coordinates (x, y) such that 

C/a('^7)(0 =^A(B7W)(a^.y) = Hby{ax{x,y))). 

Here, 6 is a smooth function of 7, and (/) is also a smooth 
function. Therefore, being a composition of two smooth func- 
tions, [/a(07)(O is smooth as well (|36J, Corollary 7.2), and 
thus DC by Proposition [T] 

In the following, we show the DC property of E. We 
describe at the same time a procedure to compute the DC 
decomposition of £^ if a DC decomposition of Ux{(l)-f){l) is 
available. We expand ||e|p in terms of the errors at individual 
pixels as 



Help 



\\v-cUx{4>,)f 

{v\l)-2v{l)cUx{<P^){l) 



E 

1 = 1 



where is the /-th component of v. The term is constant 
with respect to 7 and c. Using the DC decomposition of 
[/a((/>7)(0 and decomposing c as c = 0.5 (c+1)^ — 0.5 (c^ + 1), 
we can compute the DC decomposition of the second term 

—2v{l)cUx{(l)-f){l) from Propositions 



b and 



_ _ One can 

also obtain the decomposition of the last term & U'^{(j)^){l) 
by applying Proposition 4]b Finally, the DC decompositions 
of Help and 



E 



N 

El 



(35) 



simply follow from Proposition |4|a| 

■ 

B. Derivation of total squared tangent distance E 

The first order approximation of the manifold M{pj) 
around the point Uxi{Pj) is given by 1291 

M{pj) ^ S,{pj) = {UxM)^Tiz: z e R^^^} 

where is an n x d matrix consisting of tangent vectors. The 
k-th column of Ti is the tangent vector d/dk Uxi{Pj), which 
is the derivative of the manifold point Ux^iPj) with respect 
to the k-th transformation parameter Xi{k). The orthogonal 
projection of Ui on Si{pj) is given by Ui = Uxi{Pj) -^TiZ*, 
where the coefficient vector of the projection is = 
(Tj Ti)-^ Tj [ui - UxXPj))- Hence, the difference vector ei 
between Ui and Ui is 

Ci — Ui Hi 

Letting Wi = Ui — Ux^iPj), we get E as 

N 



^ = E 



\Wi 



(36) 



C. Computation of the DC Decompositions in Section III-C 



(34) 



In the derivation of the DC decompositions of Ux{(l)'y){l) 
and E, we build on the results from El^ where a DC 
decomposition of the distance between a query pattern and 
the 4-dimensional transformation manifold of a reference 
pattern is derived. We first give the following results from 

m 

Proposition 5: (a) Let / : ^ R be a DC function with 
decomposition f{x) = g{x) — h{x) and g : R ^ R 
be a convex function. Then q{f{x)) is DC and has the 
decomposition |3 | 

q{f{x))=p{x)-K[g{x) + h{x)], 

where p{x) = q{f{x)) + K[g{x) + h{x)] is a convex 
function and i^T is a constant satisfying K > \q\f{x))\. 

(b) Let e [0,27r) and a G R+. Then the following 
functions have DC decompositions with nonnegative 
convex parts: cos(?/^), sin(?/^), S^^ffi^ sin{ip) ^ p | 



18 



computation details. 

The relation between the transformed atom and the mother 
Gaussian function is given by the change of variables 



Ax{(l)^){x,y) = (l){x,y) 



Let the l-th pixel of Ux{(j)^) correspond to the coordinates 
{x^y) in Ax{(j)^) and the coordinates {x^y) in (j). From (12) 
and ( 14), (x, y) can be derived in the form 



cos(?/^) ^ ^sin(?/^) co^{iIj)tx sin(?/^)r^ 

^ X ^ X ^ X ^ X 

^ _ ^cos(VO _ ^sin(VO _ cos(V;)r^ ^ sin(?/;)ra; 



where 



cos(^)x + shi{0)y — cos{0)tx — sm{0)ty 

^x 

— sin(^)x + cos{6)y + sm{6)tx — cos{6)ty 



Here v and ^ are functions of the transformation parameters A 
and the coordinates (x, y) but they stay constant with respect 
to the atom parameters 7. Now we explain how the coordinate 
variables x and y can be expanded in the DC form in terms 
of the atom parameter variables. 

Firstly, from Proposition ] 5 |b[ notice that the decompositions 
of the functions | ^^^i^ sin\ip) cos{ip) sin{ip) | com- 

^ ^ (Tx ^ ay ^ ay J 

puted as explained in |3|. The first two terms in x and y are 
given by the product of one of these functions with a constant 
term (u or 0. Therefore, we can get the decompositions of 
these terms using Proposition |4]a 

Then, observe that Tx has a decomposition as Tx = 
0.5 {tx + 1)^ - 0.5 (rj + 1) |3 | and the decomposi- 
tion of Ty is obtained in the same manner. Thus, one 
can obtain the DC decompositions of the last two terms 
^cosMr^^ cosgK ^ sinMi^^ ~ ~ applying 

the product property in Proposition |4|b| on the decompositions 
of the terms in {£^^, £iEM, £^Mt' and {r^.r,}. 

Hence, having computed the decompositions of all additive 
terms in x and y, one can obtain the decompositions of x and 
y using Proposition |4|a[ 

Let z = x^ ^y'^. The decompositions of x'^ and y'^ can be 
obtained by applying the product property in Proposition |4]b 
on the decompositions of x and y. Then, the decomposition of 
z follows from Proposition 4ja Expressing the mother function 



as a convex function of z. Proposition |5|a| provides the 
decomposition of (j){x^y). Thus, we obtain the decomposition 

of Ux{c^^){l). 

After this point, the decomposition of E can be computed 
based on the description in Appendix B. However, notice 
that for this special case of Gaussian mother function, the 
decomposition of the term Ul{(j)^){l) in (34) can also be 
obtained by writing A\{(j)^){x.,y) = 2/7rexp(— 2 z), and using 
the decomposition of z and Proposition |5|a| 



Now, let us discuss the computation of the DC decompo- 
sition for the inverse multiquadric mother function ?/) 



(1 



-y'^Y^ /i < 0. Notice that the decomposition of the 



terms x and y depend only on the PTM and dictionary models 
( p^ and ( p4| ); therefore, they are the same as in the previous 
case. Since ^(x,^) = (1 + zY is a convex function of z for 
z > 0, the decomposition of (l){x^y) can be obtained using 
Proposition |5|a| (A lthough the domain of the function g is M 
in Proposition |5|a[ an examination of the proof in | 3 1 shows 
that the property can still be applied for a convex function 
q defined on a domain that is a subset of R.) This gives the 
decomposition of [/a(^7)(0- The decomposition oiUx{(j)^)(l) 
can be similarly computed by applying Proposition |5|a| to the 
function ^) = (1 + z^^ . Then, the computation of E is 
the same as in the previous case. 
D. Proof of Proposition |J] 



E 



Proof: We rearrange E in the following form. 

(l+a7?r)IK-c™C/Ar('/'7'")ll' 



M Nrr 

M 



(37) 



-E E 

m=l {i,k)eR'^ 

Equivalently, one can write 



k,m 



where 



E ■ 



Nrr. 

E 



M 

E^" 

m=l 



(1 



E 

{i,k)eR" 

4]a 



From Proposition 4|a in order to show that E is 3. DC function 
of 7 and c, it is enough to show that ^^(7^,0^) is a DC 
function of 7^ and for all m = 1, ... , M. 

In the proof of Proposition [2] we have already shown that the 
squared norm of the difference vector e = v—cUx{(l)-f) is 3. DC 
function of c and 7, and we have described a way to compute 
its DC decomposition when a DC decomposition of the com- 
ponents of /7a(07) is available. Therefore, one can obtain the 
DC decompositions of the terms — /7a"^((/>7 
^^^k,m_^m u^^^^ ((^^^) IP in the formulation of ^^(7^, 

Then, ^^(7 ' 



^ and 

)• 



,c^) is a DC function of 7^ and as it is 
a linear combination of these terms, and its decomposition is 
simply given by Proposition 4|a 



